04_time.RmdThis step of the BDC workflow extracts the collection year whenever possible from complete and legitimate date information, and flags dubious (e.g., 07/07/10), illegitimate (e.g., 1300, 2100), or not supplied (e.g., 0 or NA) collecting year.
Important:
The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.
You can install the released version of ‘BDC’ from github with:
if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")Creating folders to save the results.
bdc::bdc_create_dir()Read the database created in the *Space**](https://brunobrr.github.io/bdc/articles/03_space.html) step of the BDC workflow. It is also possible to read any datasets containing the **required** fields to run the workflow (more details here.](https://brunobrr.github.io/bdc/articles/integrate_datasets.html%22).)
database <-
qs::qread("Output/Intermediate/03_space_database.qs")Standardization of character encoding.
for (i in 1:ncol(database)){
if(is.character(database[,i])){
Encoding(database[,i]) <- "UTF-8"
}
}VALIDATION. This function flags records lacking event date information (e.g., empty or NA).
check_time <-
bdc_eventDate_empty(data = database, eventDate = "verbatimEventDate")
#>
#> bdc_eventDate_empty:
#> Flagged 3179 records.
#> One column was added to the database.ENRICHMENT. This function extracts four-digit year from unambiguously interpretable collecting dates.
check_time <-
bdc_year_from_eventDate(data = check_time, eventDate = "verbatimEventDate")
#>
#> bdc_year_from_eventDate:
#> Four-digit year were extracted from 2933 records.VALIDATION. This function identifies records with illegitimate or potentially imprecise collecting year. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900). Older records are more likely to be imprecise due to the locality-derived geo-referencing process.
check_time <-
bdc_year_outOfRange(data = check_time,
eventDate = "year",
year_threshold = 1900)
#>
#> bdc_year_outOfRange:
#> Flagged 12 records.
#> One column was added to the database.Creating a column named “.summary” summarizing the results of all VALIDATION tests. This column is FALSE if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).
check_time <- bdc_summary_col(data = check_time)
#> Column '.summary' already exist. It will be updated
#>
#> bdc_summary_col:
#> Flagged 3481 records.
#> One column was added to the database.Creating a report summarizing the results of all tests.
report <-
bdc_create_report(data = check_time,
database_id = "database_id",
workflow_step = c("prefilter", "taxonomy", "space", "time"))
#>
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report
#>
#> bdc_create_report:
#> Check the report summarizing the results of the taxonomy in:
#> Output/Report
#>
#> bdc_create_report:
#> Check the report summarizing the results of the space in:
#> Output/Report
#>
#> bdc_create_report:
#> Check the report summarizing the results of the time in:
#> Output/Report
reportCreating a histogram showing the number of records collecting over the years.
bdc_create_figures(data = check_time,
database_id = "database_id",
workflow_step = "time")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures
Number of records sampled over the years

Summary of all tests of the time step; note that some database lack event date information

Summary of all validation tests of the BDC workflow
Save the original database containing the results of all data quality tests appended in separate columns.
Let’s remove potentially erroneous or suspect records flagged by the data quality tests applied in all steps of the BDC workflow to get a “clean”, “fitness-for-use” database.
output <-
check_time %>%
dplyr::filter(.summary == TRUE) %>%
bdc_filter_out_flags(data = ., col_to_remove = "all")
#>
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .uncer_terms, .val, .equ, .zer, .cap, .cen, .urb, .otl, .gbf, .inst, .dpl, .rou, .eventDate_empty, .year_outOfRange, .summary